Guiding question:

Which chemical properties influence the quality of red wines?


Univariate Exploration and Analysis

## [1] 1599   14
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "quality.factor"
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

data set description

table(rw$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The vast majority of the wines got ranked 5 and 6 with each rank away from 5 and 6 having a magnitude less number

Overall acidity, alcohol level, and the fixed acidity all are normally distributed with positive skewness on acohol and fixed acidity.

Citric acid in a large number of wines is 0 and the distribution is relatively flat.

Acid summaries (Fixed, volatile)

summary(rw$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
summary(rw$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The measures of acidity seem to be different. Fixed acidity seems to be positively skewed while volatile acidity is less harshly skewed but has some positive outliers.

summary(rw$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

There are several major outliers for chloride measurment, but around the mean and median the distribution is normal.

Both measueres of sulfur dioxide (SO2) are positively skewed. According to the description of the data “SO2 concentrations over 50 ppm [are] evident in the nose and taste of wine.” It may be interesting later to see what affect this evident taste has on a wine’s rating.

# Number of entries with sulfur ppm above 50.
dim(subset(rw, total.sulfur.dioxide >= 50))[1]
## [1] 557
557 / 1599
## [1] 0.3483427

Out of the 1599 entries there are 557 with SO2 above 50. This is about 34.8% of the entries.

Residual sugars distribution is normal with outliers.
This seems to be a reoccurring pattern. Later it may be interesting to compare the tails with each other to see if there is a correlation between the extremes and good or bad wine.

Density is tightly distributed with a normal distribution showing no apparent skewness.

summary(rw$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Positively skewed with extreme positive outliers. Also, the entries are fairly spread out. According to the data file sulphates “can contribute to sulfur dioxide gas”.

Both do have similar tails, but the total sulfur dioxide is not as normal as the sulphates.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

Mnay of the variables are distributed normally. With the exception of density and alcohol, the distributions are skewed or have extreme outliers. The interesting thing about this is that several of the distributions appear to be skewed in the same direction. Upon further investigation it also seems that they are related chemically (sulphates and SO2).

These variables should be of interest when comparing multiple varialbes so that we can test if they are actually correlated (and how they pair of variables affect quality).

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The guiding question is about the quality of wines. All other measurements are about the chemical or physical characteristics of the wines. Therefore, in order to answer the question the variables will have to be compared to the quality individually and eventually in groups. Overall, understanding each measurements affect on a wine and then comparing how that affects the rated quality will be very important in the following (bivariate and multi-variate) analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The ratings where all very tightly packed and I had a feeling the number of wines in each rating would drop by a magnitude every rating higher or lower than 5 or 6. This was shown to be approximatley true by chaning the y-axis from continuous to log(10).

The first chlorides graph I made was very spread out with noticeable outliers, but also signs of a tail. In order to investigate further I changed the x-axis from continuous to log(10). The new graph was better, but there did not appear to be a tail as I had expected and the majority of the data seemed distributed around a fairly small section. Therefore, I decided to simply remove the ouliers (anything above the 98% quaantile) and keep the x-axis continuous. The final graph shows what I had expected after seeing the log graph; normally distributed around a small range with a trickle of outliers that did not represent a signficant positive skewness.

Total sulphur dioxide (SO2) was positively skewed, similiar to free SO2, but had a couple extreme outliers. To understand the main distribution of total SO2 I removed the values that were greater than the 99th quantile.

Residual sugars were adjusted by removing the values greater than the 99th quantile and by log transforming the x-axis (done seperately). The log and quantile graph showed the same distribution for the most part, but the quantile graph was easier to understand because of the uniform breaks.

Sulphates had extreme outliers that were removed so that it was easier to understand the ranges within the main distribution.


Bivariate Analysis

There is not a variable that has a clear correlation with quality (e.g., correlation > 0.5).

As quality goes up the sulphates seem to increase slightly but there is a lot of overlaps between the distributions.

It appears that the percent of the SO2 that is free doesn’t have a noticeable affect on quality.

The data at the extreme qualities is rather limited. This makes it hard to pick up on possible trends/correlations. Looking at the graphs it appears that the lower quality citric acid distributions are more left shifted than the higher quality distributions, but there are exceptions (i.e., wines rated 7 and 8).

This shows the best correlation between a variable and quality so far. It is far from perfect. For instance, in the lower rankings the pattern of increasing alcohol does not correlate with a better ranking. In multivariate I should see if another variable and a respectable amount of alcohol make for a bad wine.

Chlorides may be that other variable I test against.

cor.test(rw$alcohol, rw$citric.acid)
## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and rw$citric.acid
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06121189 0.15807276
## sample estimates:
##       cor 
## 0.1099032

Both graphs have a blob like scatter. There is no clear trend between just the two variables. It would be interesting to see if a third variable would order or seperate these two variables.

cor.test(rw$volatile.acidity, rw$citric.acid)
## 
##  Pearson's product-moment correlation
## 
## data:  rw$volatile.acidity and rw$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

This will be a must explore with quality colored in. Volatile acidity adds a bad vinegar taste in high amounts and citric acid in low (but existant) amounts adds freshness. I would hypothesize that the points on the left will ranked far lower than the points on the right.

It appears that volatile acidity has an affect on quality. As the quality increases the peaks of each distribution moves right (volatile acidity decreases).

If a wine has less than 50 total SO2 ppm then the wine has a noticeabbly higher quality rating.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As sulphates increased there seemed to be an increase in the quality. Although it wasn’t perfect there did seem to be a noticeable trend.

As alcohol increased there was an increase in quality. This relationship was much more pronounced than in the sulphates, and there were less extreme outliers in the alcohol distributions than there were in the sulpahtes distribution.

The less volatile acidity in the wine the better rating it got. This is not universal, there are overlaps between the quality distributions, but it is noticeable.

If a wine had less than 50ppm of sulphates it was likely to have a much better ranking.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As citric acid increased there was a decrease in volatile acidity. Therefore, I thought that there would be a correlation between citric acid and quality. However, there was no correlation between citric acid and quality.

I wonder if the wines with low volatile acidity, but not ranked high were due to a high citric acid (possible too high thus driving down the volatile acidity artifically).

What was the strongest relationship you found?

It is a tie between alcohol’s and volatile acidity’s relationship with quality.


Multivariate Analysis

Nothing stands out in the graphs above.

There doesn’t seem to be any grouping between volatile and fixed acidity. In addition, there is not clear pattern with relation to quality in the relationship either. Again, all that is clear is that wines with low volatitle acidity are ranked better (the dashed line is the mean for volatile acidity).

Seems to be a grouping of blue that has moderate amounts of citric acid and low volatile acidity.

Each dashed line is the mean. Nothing new. Wines with low volatile acidity are rated better, but we already knew that.

Nothing really useful here. The blue is citric acid in the top quarter quantile and volatile acid in bottom quarter quantile. Purple is citric in bottom half quantile and volatile in top half quantile.
I was hoping to see a relationship between the two affecting quality, but because I’m taking smaller and smaller subset I have no idea if it is signifcant when I can only compare it overall.

There seems to be a grouping of blue with high alcohol and low volatile acidity.

As sulphates and alcohol increase there appears to be an increase in quality.

Nothing of interest.

Nothing of interest.

Nothing of interest. The horizontal line is the mean value of sulphates.

Here is a plto os sulpahtes and alochol, but with wines that have low volatile acidity. It drastically reduces the number of bad and mediocore wines.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol was a consisten factor that improved quality. In addition, with low volatile acidity and high sulphates there was a stronger relationship with quality.

Were there any interesting or surprising interactions between features?

I was expecting a free SO2 to have more of an affect because of its affect on taste, but it seems that the total SO2 had more of an affect.


Final Plots and Summary

Plot One

Description One

The majority of the wines are rated 5 or 6. The number of wines rated something else drops by a magnitude the farther away it gets from 5 or 6.

Plot Two

Description Two

As alcohol increases there is trend of the wines improving in quality. The higher quality wines are a much larger proportion of the total number of wines when looking at wines with alcohol greater than the mean (to the right of the black dashed line). Also, almost all the wines with alcohol in the top 15% (to the right of the red dashed line) are high quality wines.
Perhaps this is because the longer a wine matures the more time yeast has to turn sugars into alcohol. This may mean that older wines are generally rated higher.

Plot Three

Description Three

The dashed lines are the mean alcohol level and mean sulphates. Wines with sulphates and alcohol above the mean are rated better than wines with sulphates and alcohol below the mean. Each variable individually had an effect on quality and it is clear that when combined both have an even greater affect.


Reflection

I started by looking at what was included in the dataset, what each variable was, and how the variables affected wine. Next, I began by examining the distributions of the different variables. From this I found that a lot of the data was normally distributed with either a positive skew or extreme outliers. Then I looked at how two variables affected one another only to find no clear correlation between variables I thought would be related. Therefore, I moved on a looked at varaibles distributions’ overlayed with quality. This produced some interesting results because it was possible to see what was previously a normal distribution become a distribution that showed quality following a trend. Using what I found in the bivariate section, I next explored how two variables had an affect on quality. I was hoping to find new interactions between variables that would have a clear affect on quality. I found that variables I knew had an affect on quality working together to make the improved qualities more noticeable and the variables I though might work together showed no sign of improving quality.

There are limitations to the data. There are only 1599 observations and the number of wines with quality not 5 or 6 is very low so trying to find patterns that differentiate wines based on quality is hard. In addition, the wines are only of one type, Portuguese “Vinho Verde” wine, which means that the results found here cannot be applied to all red wines in general. Finally, there are other variables not included that would be useful, such as length of fermentation, grape type, and year grape was grown. Overall, more observations and observations that were not purely chemical would help make this data set better.